Setup

Note, before using knitr please install all missing packages from the code cell below into your environment. Also, I recommended knitting into HTML as it has been optimized for viewing in that format.

R Packages used:

Please use install.packages(‘…’) to install any you may have missing.

If you have trouble knitting due to issues installing these packages you can view the knitted version by:

From the repository >> NYPD_Shooting_Incidents.html >> Download(view raw) >> Right click anywhere >> Save As… >> Open

Data Source Information

This data was procured from https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic and will be updated annually at the source and any time this .rmd file is knit.

From the source: “List of every shooting incident that occurred in NYC going back to 2006 through the end of the previous calendar year.

This is a breakdown of every shooting incident that occurred in NYC going back to 2006 through the end of the previous calendar year. This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning before being posted on the NYPD website. Each record represents a shooting incident in NYC and includes information about the event, the location and time of occurrence. In addition, information related to suspect and victim demographics is also included. This data can be used by the public to explore the nature of shooting/criminal activity. Please refer to the attached data footnotes for additional information about this dataset.”

For more information on the details of this dataset it is recommended to follow the link and access the footnotes pdf found on the landing page website.

Environment Setup

We will first begin by loading in the R packages we intend to use.

Then, we will import the data using a URL directly from the source, this ensures we will capture updates to the data as they come in, whenever this is run again.

# Output all commands run and set a standard plot size
knitr::opts_chunk$set(echo = TRUE, fig.width = 10, fig.height = 6)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
import_url <- read.csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD")

Transformation and Exploratory Data Analysis (EDA)

Let’s take a look at the dimensions of this imported data.frame, as well as the variable types of each column, and output a summary.

dim(import_url)
## [1] 27312    21
str(import_url)
## 'data.frame':    27312 obs. of  21 variables:
##  $ INCIDENT_KEY           : int  228798151 137471050 147998800 146837977 58921844 219559682 85295722 71662474 83002139 86437261 ...
##  $ OCCUR_DATE             : chr  "05/27/2021" "06/27/2014" "11/21/2015" "10/09/2015" ...
##  $ OCCUR_TIME             : chr  "21:30:00" "17:40:00" "03:56:00" "18:30:00" ...
##  $ BORO                   : chr  "QUEENS" "BRONX" "QUEENS" "BRONX" ...
##  $ LOC_OF_OCCUR_DESC      : chr  "" "" "" "" ...
##  $ PRECINCT               : int  105 40 108 44 47 81 114 81 105 101 ...
##  $ JURISDICTION_CODE      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LOC_CLASSFCTN_DESC     : chr  "" "" "" "" ...
##  $ LOCATION_DESC          : chr  "" "" "" "" ...
##  $ STATISTICAL_MURDER_FLAG: chr  "false" "false" "true" "false" ...
##  $ PERP_AGE_GROUP         : chr  "" "" "" "" ...
##  $ PERP_SEX               : chr  "" "" "" "" ...
##  $ PERP_RACE              : chr  "" "" "" "" ...
##  $ VIC_AGE_GROUP          : chr  "18-24" "18-24" "25-44" "<18" ...
##  $ VIC_SEX                : chr  "M" "M" "M" "M" ...
##  $ VIC_RACE               : chr  "BLACK" "BLACK" "WHITE" "WHITE HISPANIC" ...
##  $ X_COORD_CD             : num  1058925 1005028 1007668 1006537 1024922 ...
##  $ Y_COORD_CD             : num  180924 234516 209837 244511 262189 ...
##  $ Latitude               : num  40.7 40.8 40.7 40.8 40.9 ...
##  $ Longitude              : num  -73.7 -73.9 -73.9 -73.9 -73.9 ...
##  $ Lon_Lat                : chr  "POINT (-73.73083868899994 40.662964620000025)" "POINT (-73.92494232599995 40.81035186300006)" "POINT (-73.91549174199997 40.74260663300004)" "POINT (-73.91945661499994 40.83778200300003)" ...
summary(import_url)
##   INCIDENT_KEY        OCCUR_DATE         OCCUR_TIME            BORO          
##  Min.   :  9953245   Length:27312       Length:27312       Length:27312      
##  1st Qu.: 63860880   Class :character   Class :character   Class :character  
##  Median : 90372218   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :120860536                                                           
##  3rd Qu.:188810230                                                           
##  Max.   :261190187                                                           
##                                                                              
##  LOC_OF_OCCUR_DESC     PRECINCT      JURISDICTION_CODE LOC_CLASSFCTN_DESC
##  Length:27312       Min.   :  1.00   Min.   :0.0000    Length:27312      
##  Class :character   1st Qu.: 44.00   1st Qu.:0.0000    Class :character  
##  Mode  :character   Median : 68.00   Median :0.0000    Mode  :character  
##                     Mean   : 65.64   Mean   :0.3269                      
##                     3rd Qu.: 81.00   3rd Qu.:0.0000                      
##                     Max.   :123.00   Max.   :2.0000                      
##                                      NA's   :2                           
##  LOCATION_DESC      STATISTICAL_MURDER_FLAG PERP_AGE_GROUP    
##  Length:27312       Length:27312            Length:27312      
##  Class :character   Class :character        Class :character  
##  Mode  :character   Mode  :character        Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##    PERP_SEX          PERP_RACE         VIC_AGE_GROUP        VIC_SEX         
##  Length:27312       Length:27312       Length:27312       Length:27312      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    VIC_RACE           X_COORD_CD        Y_COORD_CD        Latitude    
##  Length:27312       Min.   : 914928   Min.   :125757   Min.   :40.51  
##  Class :character   1st Qu.:1000028   1st Qu.:182834   1st Qu.:40.67  
##  Mode  :character   Median :1007731   Median :194487   Median :40.70  
##                     Mean   :1009449   Mean   :208127   Mean   :40.74  
##                     3rd Qu.:1016838   3rd Qu.:239518   3rd Qu.:40.82  
##                     Max.   :1066815   Max.   :271128   Max.   :40.91  
##                                                        NA's   :10     
##    Longitude        Lon_Lat         
##  Min.   :-74.25   Length:27312      
##  1st Qu.:-73.94   Class :character  
##  Median :-73.92   Mode  :character  
##  Mean   :-73.91                     
##  3rd Qu.:-73.88                     
##  Max.   :-73.70                     
##  NA's   :10

Feature Removal and Renaming

Looks like we have 19 columns (features) and 25596 rows (data points).
First, let’s remove any features that we won’t be needing for our analysis.

  1. JURISDICTION_CODE is pretty broad for localizing shooting incidents so we will end up using BORO instead which will give more insight to our analysis.
  2. X_COORD_CD, Y_COORD_CD, and Lon_Lat are all redundant as we will use LONGITUDE and LATITUDE in their place.

Also, let’s rename a few of these for more readability.

# Remove features
import_url <- select(import_url, -JURISDICTION_CODE, -X_COORD_CD, -Y_COORD_CD, -Lon_Lat)
# Rename features
import_url <- import_url %>%
        rename(c('DATE' = 'OCCUR_DATE', 'TIME' = 'OCCUR_TIME','BOROUGH' = 'BORO', 
                'LOCATION' = 'LOCATION_DESC', 'MURDER_FLAG' = 'STATISTICAL_MURDER_FLAG', 
                'PERP_AGE' = 'PERP_AGE_GROUP', 'VICTIM_AGE' = 'VIC_AGE_GROUP', 'VICTIM_SEX' = 'VIC_SEX',
                'VICTIM_RACE' = 'VIC_RACE', 'LATITUDE' = 'Latitude', 'LONGITUDE' = 'Longitude'))
head(import_url)
##   INCIDENT_KEY       DATE     TIME  BOROUGH LOC_OF_OCCUR_DESC PRECINCT
## 1    228798151 05/27/2021 21:30:00   QUEENS                        105
## 2    137471050 06/27/2014 17:40:00    BRONX                         40
## 3    147998800 11/21/2015 03:56:00   QUEENS                        108
## 4    146837977 10/09/2015 18:30:00    BRONX                         44
## 5     58921844 02/19/2009 22:58:00    BRONX                         47
## 6    219559682 10/21/2020 21:36:00 BROOKLYN                         81
##   LOC_CLASSFCTN_DESC LOCATION MURDER_FLAG PERP_AGE PERP_SEX PERP_RACE
## 1                                   false                            
## 2                                   false                            
## 3                                    true                            
## 4                                   false                            
## 5                                    true    25-44        M     BLACK
## 6                                    true                            
##   VICTIM_AGE VICTIM_SEX    VICTIM_RACE LATITUDE LONGITUDE
## 1      18-24          M          BLACK 40.66296 -73.73084
## 2      18-24          M          BLACK 40.81035 -73.92494
## 3      25-44          M          WHITE 40.74261 -73.91549
## 4        <18          M WHITE HISPANIC 40.83778 -73.91946
## 5      45-64          M          BLACK 40.88624 -73.85291
## 6      25-44          M          BLACK 40.67846 -73.92795

Check for Duplicates and Remove

Next, we will check if there are any missing or duplicated data points, focusing only on the INCIDENT_KEY feature for now. This feature will be the most important for identifying any duplicate entries as they should all be unique.

# Check for any NA or Null values
any(is.na(import_url$INCIDENT_KEY)) | any(is.null(import_url$INCIDENT_KEY))
## [1] FALSE
# Check for duplicates
length(unique(import_url$INCIDENT_KEY))
## [1] 21420
length(import_url$INCIDENT_KEY)
## [1] 27312

Subtracting the results here shows that there are 5470 duplicate data points! Let’s take a closer look to make sure these aren’t false positives.

# Query duplicates to see what they look like
head(filter(import_url, duplicated(import_url$INCIDENT_KEY)))
##   INCIDENT_KEY       DATE     TIME  BOROUGH LOC_OF_OCCUR_DESC PRECINCT
## 1     85875439 07/22/2012 21:35:00    BRONX                         42
## 2     23749375 08/07/2006 22:06:00 BROOKLYN                         79
## 3    228109390 05/12/2021 23:11:00 BROOKLYN                         83
## 4    140697084 01/24/2015 05:40:00   QUEENS                        105
## 5     85875439 07/22/2012 21:35:00    BRONX                         42
## 6    142472725 04/27/2015 20:27:00 BROOKLYN                         70
##   LOC_CLASSFCTN_DESC                  LOCATION MURDER_FLAG PERP_AGE PERP_SEX
## 1                    MULTI DWELL - PUBLIC HOUS       false    18-24        M
## 2                    MULTI DWELL - PUBLIC HOUS        true      <18        M
## 3                                                    false    25-44        M
## 4                                    PVT HOUSE        true    25-44        M
## 5                    MULTI DWELL - PUBLIC HOUS       false      <18        M
## 6                                                     true    25-44        M
##   PERP_RACE VICTIM_AGE VICTIM_SEX VICTIM_RACE LATITUDE LONGITUDE
## 1     BLACK      25-44          M       BLACK 40.82488 -73.90318
## 2     BLACK        <18          M       BLACK 40.69717 -73.94375
## 3     BLACK      18-24          M       BLACK 40.68822 -73.91981
## 4     BLACK        <18          F       BLACK 40.65561 -73.75097
## 5     BLACK      25-44          M       BLACK 40.82488 -73.90318
## 6     BLACK      25-44          M       BLACK 40.63728 -73.95238
# Check a few entries
arrange(filter(import_url, INCIDENT_KEY == 227647476 | INCIDENT_KEY == 232390408), INCIDENT_KEY)
##   INCIDENT_KEY       DATE     TIME   BOROUGH LOC_OF_OCCUR_DESC PRECINCT
## 1    227647476 05/02/2021 18:18:00 MANHATTAN                         23
## 2    227647476 05/02/2021 18:18:00 MANHATTAN                         23
## 3    227647476 05/02/2021 18:18:00 MANHATTAN                         23
## 4    232390408 08/17/2021 22:20:00  BROOKLYN                         73
## 5    232390408 08/17/2021 22:20:00  BROOKLYN                         73
##   LOC_CLASSFCTN_DESC                  LOCATION MURDER_FLAG PERP_AGE PERP_SEX
## 1                    MULTI DWELL - PUBLIC HOUS       false      <18        M
## 2                    MULTI DWELL - PUBLIC HOUS       false    18-24        M
## 3                    MULTI DWELL - PUBLIC HOUS       false    18-24        M
## 4                               GROCERY/BODEGA       false                  
## 5                               GROCERY/BODEGA       false                  
##        PERP_RACE VICTIM_AGE VICTIM_SEX VICTIM_RACE LATITUDE LONGITUDE
## 1          BLACK      25-44          M       BLACK 40.78694 -73.94357
## 2 WHITE HISPANIC      25-44          M       BLACK 40.78694 -73.94357
## 3          BLACK      25-44          M       BLACK 40.78694 -73.94357
## 4                     25-44          M       BLACK 40.66835 -73.90652
## 5                     18-24          M       BLACK 40.66835 -73.90652
# Yes those are definitely duplicates
# Remove duplicates
import_url <- filter(import_url, !duplicated(import_url$INCIDENT_KEY))
# Check work
length(duplicated(import_url$INCIDENT_KEY))
## [1] 21420

That should do it for the duplicated data points. Let’s continue our transformations.

Change Feature Class Types

For better analysis we should change the class type of a few of these features to make them easier to work with.

# Character to Date and Period
import_url <- import_url %>%
        mutate(DATE = mdy(DATE)) %>%
        mutate(TIME = hms(TIME))

# Character to Factors - changes all character columns to factor
import_url <- import_url %>%
        mutate(across(where(is.character), as.factor))
str(import_url)
## 'data.frame':    21420 obs. of  17 variables:
##  $ INCIDENT_KEY      : int  228798151 137471050 147998800 146837977 58921844 219559682 85295722 71662474 83002139 86437261 ...
##  $ DATE              : Date, format: "2021-05-27" "2014-06-27" ...
##  $ TIME              :Formal class 'Period' [package "lubridate"] with 6 slots
##   .. ..@ .Data : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ year  : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ month : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ day   : num  0 0 0 0 0 0 0 0 0 0 ...
##   .. ..@ hour  : num  21 17 3 18 22 21 22 19 5 1 ...
##   .. ..@ minute: num  30 40 56 30 58 36 47 41 45 10 ...
##  $ BOROUGH           : Factor w/ 5 levels "BRONX","BROOKLYN",..: 4 1 4 1 1 2 4 2 4 4 ...
##  $ LOC_OF_OCCUR_DESC : Factor w/ 3 levels "","INSIDE","OUTSIDE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PRECINCT          : int  105 40 108 44 47 81 114 81 105 101 ...
##  $ LOC_CLASSFCTN_DESC: Factor w/ 10 levels "","COMMERCIAL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LOCATION          : Factor w/ 41 levels "","(null)","ATM",..: 1 1 1 1 1 1 1 1 1 26 ...
##  $ MURDER_FLAG       : Factor w/ 2 levels "false","true": 1 1 2 1 2 2 1 2 1 1 ...
##  $ PERP_AGE          : Factor w/ 11 levels "","(null)","<18",..: 1 1 1 1 7 1 1 1 1 7 ...
##  $ PERP_SEX          : Factor w/ 5 levels "","(null)","F",..: 1 1 1 1 4 1 1 1 1 4 ...
##  $ PERP_RACE         : Factor w/ 9 levels "","(null)","AMERICAN INDIAN/ALASKAN NATIVE",..: 1 1 1 1 5 1 1 1 1 5 ...
##  $ VICTIM_AGE        : Factor w/ 7 levels "<18","1022","18-24",..: 3 3 4 1 5 4 4 3 4 4 ...
##  $ VICTIM_SEX        : Factor w/ 3 levels "F","M","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ VICTIM_RACE       : Factor w/ 7 levels "AMERICAN INDIAN/ALASKAN NATIVE",..: 3 3 6 7 3 3 3 3 3 3 ...
##  $ LATITUDE          : num  40.7 40.8 40.7 40.8 40.9 ...
##  $ LONGITUDE         : num  -73.7 -73.9 -73.9 -73.9 -73.9 ...

More Feature Checks

We will continue to look at the features and see if any of these blank entries will cause trouble during the analysis. Also, we’ll look to see if there are any duplicate categorical factors in the rest of the features.

# Create a table of each column to check factor levels
for (i in 1:length(import_url)){
    ifelse(is.factor(import_url[ ,i]), print(table(import_url[ ,i, drop = FALSE])), next)
}
## BOROUGH
##         BRONX      BROOKLYN     MANHATTAN        QUEENS STATEN ISLAND 
##          6019          8806          2747          3229           619 
## LOC_OF_OCCUR_DESC
##          INSIDE OUTSIDE 
##   20126     181    1113 
## LOC_CLASSFCTN_DESC
##              COMMERCIAL    DWELLING     HOUSING       OTHER PARKING LOT 
##       20126          64         104         225          23           4 
##  PLAYGROUND      STREET     TRANSIT     VEHICLE 
##          18         829           9          18 
## LOCATION
##                                              (null)                       ATM 
##                     11964                       749                         1 
##                      BANK            BAR/NIGHT CLUB         BEAUTY/NAIL SALON 
##                         2                       425                        80 
##               CANDY STORE               CHAIN STORE                CHECK CASH 
##                         7                         5                         1 
##         CLOTHING BOUTIQUE           COMMERCIAL BLDG                DEPT STORE 
##                        11                       214                         5 
##            DOCTOR/DENTIST                DRUG STORE       DRY CLEANER/LAUNDRY 
##                         1                         8                        24 
##         FACTORY/WAREHOUSE                 FAST FOOD               GAS STATION 
##                         6                        81                        56 
##            GROCERY/BODEGA      GYM/FITNESS FACILITY                  HOSPITAL 
##                       514                         3                        54 
##               HOTEL/MOTEL             JEWELRY STORE              LIQUOR STORE 
##                        28                         9                        26 
##              LOAN COMPANY   MULTI DWELL - APT BUILD MULTI DWELL - PUBLIC HOUS 
##                         1                      2139                      3883 
##                      NONE          PHOTO/COPY STORE                 PVT HOUSE 
##                       140                         1                       688 
##          RESTAURANT/DINER                    SCHOOL                SHOE STORE 
##                       159                         1                         5 
##            SMALL MERCHANT SOCIAL CLUB/POLICY LOCATI          STORAGE FACILITY 
##                        22                        44                         1 
##        STORE UNCLASSIFIED               SUPERMARKET           TELECOMM. STORE 
##                        29                        15                         5 
##             VARIETY STORE               VIDEO STORE 
##                        10                         3 
## MURDER_FLAG
## false  true 
## 17687  3733 
## PERP_AGE
##          (null)     <18    1020   18-24     224   25-44   45-64     65+     940 
##    8120     525    1094       1    4372       1    4097     427      46       1 
## UNKNOWN 
##    2736 
## PERP_SEX
##        (null)      F      M      U 
##   8090    525    238  11217   1350 
## PERP_RACE
##                                                        (null) 
##                           8090                            525 
## AMERICAN INDIAN/ALASKAN NATIVE       ASIAN / PACIFIC ISLANDER 
##                              2                             98 
##                          BLACK                 BLACK HISPANIC 
##                           8430                            899 
##                        UNKNOWN                          WHITE 
##                           1621                            218 
##                 WHITE HISPANIC 
##                           1537 
## VICTIM_AGE
##     <18    1022   18-24   25-44   45-64     65+ UNKNOWN 
##    2106       1    7917    9815    1409     134      38 
## VICTIM_SEX
##     F     M     U 
##  1663 19750     7 
## VICTIM_RACE
## AMERICAN INDIAN/ALASKAN NATIVE       ASIAN / PACIFIC ISLANDER 
##                              7                            283 
##                          BLACK                 BLACK HISPANIC 
##                          15632                           1988 
##                        UNKNOWN                          WHITE 
##                             50                            540 
##                 WHITE HISPANIC 
##                           2920

Looks like there are quite a few blank entries and a few labeled as “UNKNOWN”.

  • Over 50% of the LOCATION and PERP_AGE data is unknown.
  • BOROUGH, MURDER_FLAG, PRECINCT, DATE, TIME, LONGITUDE, and LATITUDE have no missing entries.
  • The remaining features are missing a few, but are not a significant amount compared to the overall size of the dataset.

We will combine these by labeling all blanks as “UNKNOWN”. This data likely comes from officers on the scene who may have: 1. missed some information; 2. did not have a witness and/or did not catch the offender; or 3. not have previously recorded this particular data but now are due to process changes. It’s reasonable to leave these missing data points in because the missing data also gives us information about that incident. Also, removing these data points due to their empty entries would be a mistake considering the data that is complete holds more relevance. Removing it would be like throwing away the baby with the bathwater (e.g. removing a data point that is missing the LOCATION description, but isn’t missing the rest of the information will only hurt our analysis. Especially considering BOROUGH,LONGITUDE, andLATITUDE aren’t missing).

Here we will combine and correct any missing entries with the methods discussed above.

for (i in 1:length(import_url)){
    # IF the column is a factor AND contains empty space OR a 'U'
    if(is.factor(import_url[ ,i]) && '' %in% import_url[ ,i] || 'U' %in% import_url[ ,i]){
        # Add level named UNKNOWN
        levels(import_url[, i]) <- c(levels(import_url[, i]), 'UNKNOWN')
        # Replace missing values and Us with UNKNWON
        import_url[, i][import_url[, i] == ''] <- as.factor('UNKNOWN')
        import_url[, i][import_url[, i] == 'U'] <- as.factor('UNKNOWN')
        # Remove unused levels
        import_url[, i] <- droplevels(import_url[, i])
    }else{
        next
    }
}

Lastly there are a few values in PERP_AGE that look like data entry typos. See table above:(‘1020’, ‘224’, ‘940’) Here, we cannot assume what was intended so we will change these age values to ‘UNKNOWN’.

# Before changes
table(import_url$PERP_AGE)
## 
##  (null)     <18    1020   18-24     224   25-44   45-64     65+     940 UNKNOWN 
##     525    1094       1    4372       1    4097     427      46       1   10856
# Set values we want to keep as levels
age_levels <- c('<18', '18-24', '25-44', '45-64', '65+', 'UNKNOWN')
# Find all values NOT in age_levels (notice the !)
import_url$PERP_AGE[!import_url$PERP_AGE %in% age_levels] <- as.factor('UNKNOWN')
# Remove unused levels
import_url$PERP_AGE <- droplevels(import_url$PERP_AGE)
# After changes
table(import_url$PERP_AGE)
## 
##     <18   18-24   25-44   45-64     65+ UNKNOWN 
##    1094    4372    4097     427      46   11384

EDA Cont.

Let’s take a look at a summary of this data now that we’ve cleaned it up.

summary(import_url)
##   INCIDENT_KEY            DATE                 TIME                          
##  Min.   :  9953245   Min.   :2006-01-01   Min.   :0S                         
##  1st Qu.: 64394528   1st Qu.:2009-08-02   1st Qu.:3H 28M 0S                  
##  Median : 91165008   Median :2013-06-14   Median :15H 5M 0S                  
##  Mean   :121166392   Mean   :2014-01-17   Mean   :12H 41M 9.29691876750439S  
##  3rd Qu.:188062788   3rd Qu.:2018-09-25   3rd Qu.:20H 43M 0S                 
##  Max.   :261190187   Max.   :2022-12-31   Max.   :23H 59M 0S                 
##                                                                              
##           BOROUGH     LOC_OF_OCCUR_DESC    PRECINCT       LOC_CLASSFCTN_DESC
##  BRONX        :6019   INSIDE :  181     Min.   :  1.00   UNKNOWN   :20126   
##  BROOKLYN     :8806   OUTSIDE: 1113     1st Qu.: 44.00   STREET    :  829   
##  MANHATTAN    :2747   UNKNOWN:20126     Median : 69.00   HOUSING   :  225   
##  QUEENS       :3229                     Mean   : 66.12   DWELLING  :  104   
##  STATEN ISLAND: 619                     3rd Qu.: 81.00   COMMERCIAL:   64   
##                                         Max.   :123.00   OTHER     :   23   
##                                                          (Other)   :   49   
##                       LOCATION     MURDER_FLAG      PERP_AGE    
##  UNKNOWN                  :11964   false:17687   <18    : 1094  
##  MULTI DWELL - PUBLIC HOUS: 3883   true : 3733   18-24  : 4372  
##  MULTI DWELL - APT BUILD  : 2139                 25-44  : 4097  
##  (null)                   :  749                 45-64  :  427  
##  PVT HOUSE                :  688                 65+    :   46  
##  GROCERY/BODEGA           :  514                 UNKNOWN:11384  
##  (Other)                  : 1483                                
##     PERP_SEX              PERP_RACE      VICTIM_AGE     VICTIM_SEX   
##  (null) :  525   UNKNOWN       :9711   <18    :2106   F      : 1663  
##  F      :  238   BLACK         :8430   1022   :   1   M      :19750  
##  M      :11217   WHITE HISPANIC:1537   18-24  :7917   UNKNOWN:    7  
##  UNKNOWN: 9440   BLACK HISPANIC: 899   25-44  :9815                  
##                  (null)        : 525   45-64  :1409                  
##                  WHITE         : 218   65+    : 134                  
##                  (Other)       : 100   UNKNOWN:  38                  
##                          VICTIM_RACE       LATITUDE       LONGITUDE     
##  AMERICAN INDIAN/ALASKAN NATIVE:    7   Min.   :40.51   Min.   :-74.25  
##  ASIAN / PACIFIC ISLANDER      :  283   1st Qu.:40.67   1st Qu.:-73.94  
##  BLACK                         :15632   Median :40.70   Median :-73.92  
##  BLACK HISPANIC                : 1988   Mean   :40.74   Mean   :-73.91  
##  UNKNOWN                       :   50   3rd Qu.:40.82   3rd Qu.:-73.88  
##  WHITE                         :  540   Max.   :40.91   Max.   :-73.70  
##  WHITE HISPANIC                : 2920   NA's   :9       NA's   :9

Changing many of these features from characters into factors really improves R’s ability to summarize the data here. We can make many conclusions about this data set just from looking at the summary.

  • The earliest data point is from January 1, 2006.
  • The most recent is from December, 31 2021.
  • The distribution of DATE is slightly right-skewed, with a mean 9 months in the future of the median. This suggests there were more incidents from 2006-2014 than 2014-2021.
  • Out of the 20126 shooting incidents, 3522 are associated with a murder, ~17.5%.
  • Majority of perpetrators are between the ages of 18-24 and 25-44.
  • The vast majority of identified perpetrators are male.
  • The victim’s age are a little more equally distributed than the perp’s but the vast majority are also between 18-44.

Let’s take a closer look and check the DATE distribution skewness by finding the midpoint date of the data set (this is the exact middle date of this data set).

data_interval <- interval(min(import_url$DATE), max(import_url$DATE))
int_start(data_interval) + (int_end(data_interval) - int_start(data_interval)) / 2
## [1] "2014-07-02 UTC"
median(import_url$DATE)
## [1] "2013-06-14"

Interesting, this proves there were more incidents in the first half of the time interval of the data set. We will look into this further during visualizations.

Visualizations

Now that the transformations are complete, let’s start plotting these features against each other and visualizing our data so we can try to make some conclusions and more clearly inform them.

Let’s begin by visualizing the number of shooting incidents by the BOROUGH they occurred in.

# Order factor levels of BOROUGH based on frequency
import_url$BOROUGH <- fct_infreq(import_url$BOROUGH)
# Plot graph
import_url %>%
   ggplot(., aes(x = reorder(BOROUGH, BOROUGH, length, decreasing = TRUE), fill = BOROUGH)) +
   geom_bar(aes(y = after_stat(count))) +
   scale_fill_brewer(palette = 'YlOrRd', direction = -1) +
   theme(axis.text.x = element_text(angle = 25, hjust = 1)) +
   labs(title = 'Number of Shooting Incidents by Borough',
        x = 'Borough', y = 'Number of Incidents',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')

As you can see the majority of shooting incidents have occurred in Brooklyn followed by the Bronx and Queens.

Now to plot the number of shooting incidents organized by the LOCATION description assigned to them.

import_url %>%
   ggplot(., aes(x = reorder(LOCATION, LOCATION, length, decreasing = TRUE), fill = LOCATION)) + 
   geom_bar(aes(y = after_stat(count)), show.legend = FALSE) +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
   scale_y_log10() +
   labs(title = 'Number of Shooting Incidents by Location Description',
        x = 'Location Description', y = 'Number of Incidents (Log Scaled)',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')

Something to note, this graph has been logarithmically scaled on the y-axis for easier viewing.

# Group by month and total # of incidents
inc_month_totals <- import_url %>%
    group_by(month_totals = months(DATE)) %>%
    summarize(n = n()) %>%
    arrange(match(month_totals, month.name)) %>%
    mutate(month_totals = factor(month_totals, levels = month.name))

# Plot the tibble
month_plot <- ggplot(inc_month_totals, aes(x = month_totals, y = n, fill = month_totals)) +
    geom_col()+
    scale_fill_brewer(palette = 'Paired', direction = 1, name = 'Month') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = 'Shooting Incidents per Month',
        x = 'Month', y = 'Number of Incidents',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')

# Group by hour and total # of incidents
inc_hour_totals <- import_url %>%
    group_by(hour_totals = hour(TIME)) %>%
    summarize(n = n()) %>%
    mutate(hour_totals = factor(hour_totals))

# Plot
hour_plot <- ggplot(inc_hour_totals, aes(x = hour_totals, y = n, fill = hour_totals)) +
    geom_col(show.legend = FALSE) +
    labs(title = 'Shooting Incidents per Hour of the Day',
        x = 'Hour (24H)', y = 'Number of Incidents',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')

# Plot side-by-side
grid.arrange(month_plot, hour_plot, nrow = 1)

Looking at these makes me ask the question, “have shootings becoming more or less frequent over the years”. Let’s find out what this data suggests the answer is next.

inc_year_totals <- import_url %>%
    group_by(year_totals = year(DATE)) %>%
    summarize(n = n()) %>%
    mutate(year_totals = factor(year_totals))

year_plot <- ggplot(inc_year_totals, aes(x = year_totals, y = n, fill = year_totals)) +
    geom_col(show.legend = FALSE) +
    theme(axis.text.x = element_text(angle = 25, hjust = 1)) +
    labs(title = 'Shooting Incidents per Year',
        x = 'Year', y = 'Number of Incidents',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
year_plot

Let’s take a closer look at 2020 and 2021 but include the months to see if the pandemic had any noticeable influence. We’ll also include 2018-2021 for context.

options(dplyr.summarise.inform = FALSE)
covid_slice <- import_url %>%
    filter(DATE > '2018-01-01') %>%
    mutate(MONTH = stamp('January', orders = '%B', quiet = TRUE)(DATE)) %>%
    mutate(YEAR = stamp('2020', orders = 'y', quiet = TRUE)(DATE))

covid_months_totals <- covid_slice %>%
    group_by(MONTH, YEAR) %>%
    summarize(n = sum(n())) %>%
    arrange(match(MONTH, month.name)) %>%
    mutate(MONTH = factor(MONTH, levels = month.name))

ggplot(covid_months_totals, aes(x = MONTH, y = n, fill = YEAR)) +
    geom_col() +
    scale_fill_brewer(palette = 'Spectral', direction = 1, name = 'Year') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = 'Stacked Barchart Shooting Incidents per Year 2018-2021',
        x = 'Month', y = 'Number of Incidents',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')

Interesting, here it shows there was an increase in shooting incidents beginning in May and June of 2020, coming off a downward trend from previous years. Let’s take another view here and include the date when New York City initiated their lockdown to see if it aligns with this increase.

covid_slice_2 <- import_url %>%
    filter(DATE > '2018-01-01') %>%
    mutate(DATE = stamp('2020-01', orders = 'ym', quiet = TRUE)(DATE))

covid_months_totals_2 <- covid_slice_2 %>%
    group_by(DATE) %>%
    summarize(n = sum(n()))

ggplot(covid_months_totals_2, aes(x = DATE, y = n, fill = DATE)) +
    geom_col(show.legend = FALSE) +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
    geom_vline(xintercept = 27.25) +
    annotate('label', x = 27.25, y = 175, angle = 90, color = 'black', label = 'Covid Pandemic Lockdown Initiated') +
    annotate("rect", xmin = 0, xmax = 27.25, ymin = 0, ymax = 250, alpha = .2, fill = "#00c9d0") +
    annotate("rect", xmin = 27.25, xmax = 48.5, ymin = 0, ymax = 250, alpha = .2, fill = "#c42a07") +
    labs(title = 'Shooting Incidents per Year 2018-2021',
        x = 'Month', y = 'Number of Incidents',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')

There certainly is some underlying correlation here, but from this data we can’t say for sure what that is exactly. That said, this is an interesting result because the increase in shootings seems to correlate to the period of time where the pandemic lockdown in NYC would have been in affect for over a month and the increase in shootings occurs when restrictions would have been begun to loosen.

Incident Coordinate Data Visualized on a Map

For reference here is a map of NYC’s boroughs in the public domain from Wikipedia: https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/5_Boroughs_Labels_New_York_City_Map.svg

reference_map <- 'https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/5_Boroughs_Labels_New_York_City_Map.svg'
knitr::include_graphics(reference_map)

Labels on borough reference map:

  1. Manhattan
  2. Brooklyn
  3. Queens
  4. The Bronx
  5. Staten Island

Here we’re going to visualize the location of each shooting incident using the coordinates given in the dataset. First, we can use the minimum and maximum values of the longitudes and latitudes to find the map’s bounding box (edges). Then, use ggmap() to generate a map centered around these coordinates. Then, we can use geom_point() and stat_density2d_filled() to superimpose our data on the map using the same coordinate system we generated.

# Initialize the bounding box that will contain the map view edges
map_bounds <- c(
    left = min(import_url$LONGITUDE, na.rm = TRUE),
    bottom = min(import_url$LATITUDE, na.rm = TRUE),
    right = max(import_url$LONGITUDE, na.rm = TRUE),
    top = max(import_url$LATITUDE, na.rm = TRUE))

# Initialize the map of NYC using map_bounds
# Note, there are better maps out there but most require a private google API key,
# which wouldn't work for this public, knit-able, project.
incident_map_point <- ggmap(get_stamenmap(map_bounds, maptype = 'terrain', zoom = 11)) + 
# Overlay each data point using LONG. and LAT.
    geom_point(data = import_url,
            aes(x = LONGITUDE, y = LATITUDE),
            color = 'darkred',
            size = 0.25,
            alpha = 0.5,
            na.rm = TRUE) +
    ggtitle('Point Plot of NYPD Shooting Incident Reporting 2006 - 2021') +
    labs(x = 'Longitude', y = 'Latitude',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
incident_map_point_color <- ggmap(get_stamenmap(map_bounds, maptype = 'terrain', zoom = 11)) + 
# Overlay each data point using LONG. and LAT.
    geom_point(data = import_url,
        aes(x = LONGITUDE, y = LATITUDE, color = BOROUGH),
        size = 0.25,
        alpha = 0.5,
        na.rm = TRUE) +
    scale_color_brewer(palette = 'Set1', direction = 1, name = 'Borough') +
    theme(legend.position = 'bottom') +
    guides(color = guide_legend(override.aes = list(size = 5, alpha = 1))) +
    ggtitle('Point Plot of NYPD Shooting Incident Reporting 2006 - 2021') +
    labs(x = 'Longitude', y = 'Latitude',
        caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
# Display Point Map
grid.arrange(incident_map_point, incident_map_point_color, ncol = 2)

This looks great and gives a much deeper understanding of the spacial distribution of the shootings, much more than the previous borough plot.

  • Central Park seems to be quite safe from shootings (though this may be due to the methods of recording incidents).
  • The North side of Manhattan is responsible for the majority of the shootings in that borough.
  • Comparatively, the distribution of shootings in Queens is quite evenly spread, and mostly on the South side.
  • There are some hot spots that would not be obvious without the coordinate data, namely around the coasts.

The point map does have limitations though. There is a loss of information when many points overlap. It’s hard to compare the higher density areas to one another (e.g. we know from the previous graph that Brooklyn has >2000 more incidents than The Bronx, but here you really can’t parse that out). Let’s try to convert this into a heat/density map to get a better picture of these higher density areas.

# Initialize density map to better visualize regions with frequent incidents.
incident_map_density <- ggmap(get_stamenmap(map_bounds, maptype = 'terrain', zoom = 11)) + 
    stat_density2d_filled(data = import_url,
        contour_var = 'density',
        aes(x = LONGITUDE, y = LATITUDE, fill = after_stat(level)),
        bins = 20,
        geom = 'polygon',
        alpha = 0.8,
        na.rm = TRUE) +
    geom_density_2d(data = import_url,
        aes(x = LONGITUDE, y = LATITUDE),
        bins = 20,
        alpha = 0.2,
        color = "white",
        na.rm = TRUE) +
    guides(fill = guide_legend(title = "Density")) +
    ggtitle('Density Plot of NYPD Shooting Incident Reporting 2006 - 2021') +
    labs(x = 'Longitude', y = 'Latitude',
    caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
# Display Density Map
incident_map_density

  • Now it’s much more obvious that Brooklyn exhibits more shooting incidents than The Bronx, denoted here by a more intense yellow/green.
  • Though the total amount in Manhattan is lower than Queens, you can see it’s much more concentrated to one area in Manhattan.
  • The masking of the lower density areas highlights some areas we may have overlooked on the scatter plot.

Model

We will now build a model that will try to predict whether a murder will occur depending on the total number of shootings that occur on that day, as well as taking into account the day of the week. This will also require us to add a few more features.

# Add DAY and YEAR features for easier parsing
import_url$DAY <- factor(wday(import_url$DATE, label = TRUE, abbr = FALSE), ordered = FALSE)
import_url$YEAR <- factor(year(import_url$DATE), ordered = FALSE)
# Add TOTAL_TODAY which = the total number of incidents on a given DATE.
import_url <- import_url %>%
    group_by(DATE) %>%
    mutate(TOTAL_TODAY = sum(duplicated(DATE)) + 1)
# Model
murder_model <- glm(MURDER_FLAG ~ TOTAL_TODAY - 1 + DAY, data = import_url, family = 'binomial'); murder_model
## 
## Call:  glm(formula = MURDER_FLAG ~ TOTAL_TODAY - 1 + DAY, family = "binomial", 
##     data = import_url)
## 
## Coefficients:
##  TOTAL_TODAY     DAYSunday     DAYMonday    DAYTuesday  DAYWednesday  
##     -0.01334      -1.50816      -1.51111      -1.49267      -1.38652  
##  DAYThursday     DAYFriday   DAYSaturday  
##     -1.44675      -1.46051      -1.54235  
## 
## Degrees of Freedom: 21420 Total (i.e. Null);  21412 Residual
## Null Deviance:       29690 
## Residual Deviance: 19800     AIC: 19820
plot(murder_model)

model_summary <- summary(murder_model); model_summary
## 
## Call:
## glm(formula = MURDER_FLAG ~ TOTAL_TODAY - 1 + DAY, family = "binomial", 
##     data = import_url)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## TOTAL_TODAY  -0.013337   0.006392  -2.087   0.0369 *  
## DAYSunday    -1.508163   0.059421 -25.381   <2e-16 ***
## DAYMonday    -1.511106   0.058155 -25.984   <2e-16 ***
## DAYTuesday   -1.492668   0.059164 -25.229   <2e-16 ***
## DAYWednesday -1.386524   0.058102 -23.863   <2e-16 ***
## DAYThursday  -1.446748   0.059371 -24.368   <2e-16 ***
## DAYFriday    -1.460505   0.057269 -25.502   <2e-16 ***
## DAYSaturday  -1.542353   0.057962 -26.610   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 29694  on 21420  degrees of freedom
## Residual deviance: 19803  on 21412  degrees of freedom
## AIC: 19819
## 
## Number of Fisher Scoring iterations: 4

Looking at the coefficient p-values etc, maybe a little unsurprisingly, there is a strong relation between day of the week, total shooting incidents per day, and whether or not a murder occurs, but it is worth confirming.

Conclusion

Even features that initially seemed sparse, considering the number of missing entries, led to valuable insights into this data.

Some takeaways we’ve seen from above:

  1. While the overall trend in number of shooting incidents was going down year after year from 2006 to the beginning of 2020, they saw a large increase from right after the Covid pandemic began.
  2. Majority of shootings in the dataset occur in Brooklyn, The Bronx, and near public and private housing.
  3. Even in low incident areas there are identifiable hotspots that most shootings occur (e.g. Staten Island, though low on the totals scale, shows ~5 localized clusters of incidents).
  4. While women make up ~8.18% of the victims in the cases where the sex was known, they only make up ~2.03% of the perpetrators where the sex was known.
  5. The north side of Manhattan is responsible for a large portion of the number of incidents in the borough.

Bias

Much of the bias of this analysis comes from choosing what features to include or not and possibly what conclusions can be drawn. My experiences while living in two different metropolitan cities for half my life or my past experiences with police officers could also affect the lens I view this information. Also, ever the optimist I assumed shootings would have seen a reduction during the pandemic, but seeing the data now I can see another story. In an attempt to mitigate these biases I’ve chosen to only include what I believe the data to empirically show, and not obfuscate any methods in obtaining these results.

Another bias that is likely present here is how the data was collected, reviewed, and updated. We don’t know the exact method in this case, and it is likely working off of hundreds of police officers’ efforts over the years who went about their work with their own biases. These are all things to keep in mind when working with these datasets and drawing conclusions from them. Hopefully I’ve made the conclusions made here reproducible and clear.

sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin22.4.0 (64-bit)
## Running under: macOS Ventura 13.2.1
## 
## Matrix products: default
## BLAS:   /opt/homebrew/Cellar/openblas/0.3.23/lib/libopenblasp-r0.3.23.dylib 
## LAPACK: /opt/homebrew/Cellar/r/4.3.0/lib/R/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gridExtra_2.3   ggmap_3.0.2     lubridate_1.9.2 forcats_1.0.0  
##  [5] stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1     readr_2.1.4    
##  [9] tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.5          utf8_1.2.3          generics_0.1.3     
##  [4] bitops_1.0-7        jpeg_0.1-10         stringi_1.7.12     
##  [7] lattice_0.21-8      hms_1.1.3           digest_0.6.31      
## [10] magrittr_2.0.3      RColorBrewer_1.1-3  evaluate_0.20      
## [13] grid_4.3.0          timechange_0.2.0    fastmap_1.1.1      
## [16] plyr_1.8.8          jsonlite_1.8.4      httr_1.4.5         
## [19] fansi_1.0.4         viridisLite_0.4.1   scales_1.2.1       
## [22] isoband_0.2.7       jquerylib_0.1.4     cli_3.6.1          
## [25] rlang_1.1.0         munsell_0.5.0       withr_2.5.0        
## [28] cachem_1.0.7        yaml_2.3.7          tools_4.3.0        
## [31] tzdb_0.3.0          colorspace_2.1-0    curl_5.0.0         
## [34] vctrs_0.6.2         R6_2.5.1            png_0.1-8          
## [37] lifecycle_1.0.3     MASS_7.3-59         pkgconfig_2.0.3    
## [40] pillar_1.9.0        bslib_0.4.2         gtable_0.3.3       
## [43] Rcpp_1.0.10         glue_1.6.2          RgoogleMaps_1.4.5.3
## [46] highr_0.10          xfun_0.39           tidyselect_1.2.0   
## [49] knitr_1.42          farver_2.1.1        htmltools_0.5.5    
## [52] labeling_0.4.2      rmarkdown_2.21      compiler_4.3.0     
## [55] sp_1.6-0